Libraries
Download Data

Introduction

I chose to do my final project on the Titanic Data Set. I chose this particular dataset because of it’s popularity among young data scientists. It is one of the easiest datasets to begin with for learning to build a predictive model. I wanted to get familar with it so that I can do this myself in the near future, but also thought it would be a fun topic to do the project on.

My Question is: What variables are most important for indicating whether or not someone survived the Titanic disaster?

Exploring Relationships

Make cateogorical Data Numeric

## # A tibble: 6 × 8
##   Survived Pclass   Sex   Age SibSp Parch  Fare Embarked
##      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1        0      3     1    22     1     0  7.25        1
## 2        1      1     2    38     1     0 71.3         2
## 3        1      3     2    26     0     0  7.92        1
## 4        1      1     2    35     1     0 53.1         1
## 5        0      3     1    35     0     0  8.05        1
## 6        0      1     1    54     0     0 51.9         1

Corrplot

Some correlations to note are:

-Class & Fare Price: Class is cateogorical and is split into 3 categories of (1,2,3). Fare is numeric. There is a moderate negative correlation (-0.55) between the two variables. This indicates that as the price of the fare increases, class decreases. The assumption that higher fares are associated with 1st class (1) can already be made, but it is nice knowing that it is statistically correct as well.

-Age & Class: Low negative correlation (-0.37). Indicates as Age increases, class decreases (closer to 1st class). This assumption can be made that older people have more money than younger passengers.

The corrplot shows a few relationships when it comes to the ‘Survived’ variable:

-Class: ‘Class’ has a low negative correlation (-0.36). Class gets worse (economically) as it increases. (1st class is 1, 2nd is 2, and 3rd is 3). Therefore, wealthier people are more likely to be part of first class. ‘Survived’ indicates that the passenger died (0) or survived (1), meaning the higher the variable, the better chances of survival. Therefore, the correlation indicates that as survival increases, class decreases. This indicates that wealthier people (people of 1st class), or more likely to survive

-Sex: ‘Sex’ has a moderate positive correlation (0.54). As sex increases (man (1) to woman (2)), so does survival (death (1) to survival (2)). This indicates that women were more likely to survive.

Summary of Passengers

Age Categories: -Baby (0-2) -Toddler (2-5) -Child (5-13) -Teen (13-20) -Adult (20-40) -MAA (40-60) -Senior (60+) This plot shows that mostly Adults (Ages 20-40) were on board. Middle-Aged Adults the second most common age category. There appears to be an upwards trend until Adults, and a downwards trend following Adults. There are also more male passengers than female passengers in just about every age category.

This Map shows the 3 ports that the Titanic departed from. Southampton can be observed as the port with the largest number of passengers. A little of over half of the passengers from Southampton are third class. The remaining are split up somewhat evenly. The second largest port is Cherbourg, which had over half of its passengers in first class. The smallest port is Queenstown, which had 77 passengers and 72 of them were third class.

The plot shows the volume of passengers, as well as fare price for each class. Third class is the largest, first class in the middle, and second class last largest. As expected, the price goes up as you get closer to First Class.

Analysis

Democgraphics

This plot shows the survival rate of passengers by their age category and gender. It can be observed that Females survived at a significanly higher rate than Males. For women, it appears that the survival rate for babies and seniors was 100%. All other cateogories seem to be similar. For men, the chances of survival decrease as age increases.

## # A tibble: 889 × 10
## # Groups:   Embarked [3]
##    Embarked  Pclass Survived Total…¹ Embar…²   lat  long Embar…³ Embar…⁴ Embar…⁵
##    <chr>      <dbl>    <dbl>   <int>   <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>
##  1 Cherbourg      1        0     168      93  49.6 -1.62      75    55.4    44.6
##  2 Cherbourg      1        1     168      93  49.6 -1.62      75    55.4    44.6
##  3 Cherbourg      1        1     168      93  49.6 -1.62      75    55.4    44.6
##  4 Cherbourg      3        0     168      93  49.6 -1.62      75    55.4    44.6
##  5 Cherbourg      1        1     168      93  49.6 -1.62      75    55.4    44.6
##  6 Cherbourg      1        1     168      93  49.6 -1.62      75    55.4    44.6
##  7 Cherbourg      3        1     168      93  49.6 -1.62      75    55.4    44.6
##  8 Cherbourg      2        0     168      93  49.6 -1.62      75    55.4    44.6
##  9 Cherbourg      3        0     168      93  49.6 -1.62      75    55.4    44.6
## 10 Cherbourg      1        1     168      93  49.6 -1.62      75    55.4    44.6
## # … with 879 more rows, and abbreviated variable names ¹​TotalEmbarked,
## #   ²​EmbarkedSurvived, ³​EmbarkedDied, ⁴​EmbarkedSRate, ⁵​EmbarkedDRate
## # A tibble: 9 × 6
## # Groups:   Embarked, Pclass [9]
##   Embarked    Pclass TotalEmbarkedPclass TotalEmbPclassSur TotalEmbPclas…¹ Death
##   <chr>       <chr>                <int>             <dbl>           <dbl> <dbl>
## 1 Southampton Third                  353                67            19.0  81.0
## 2 Cherbourg   First                   85                59            69.4  30.6
## 3 Southampton First                  127                74            58.3  41.7
## 4 Queenstown  Third                   72                27            37.5  62.5
## 5 Cherbourg   Second                  17                 9            52.9  47.1
## 6 Southampton Second                 164                76            46.3  53.7
## 7 Cherbourg   Third                   66                25            37.9  62.1
## 8 Queenstown  First                    2                 1            50    50  
## 9 Queenstown  Second                   3                 2            66.7  33.3
## # … with abbreviated variable name ¹​TotalEmbPclassSurPerc

## # A tibble: 891 × 5
##    Survived Pclass Sex    Embarked    cat    
##    <chr>    <chr>  <chr>  <chr>       <chr>  
##  1 Died     Third  Male   Southampton Adult  
##  2 Survived First  Female Cherbourg   Adult  
##  3 Survived Third  Female Southampton Adult  
##  4 Survived First  Female Southampton Adult  
##  5 Died     Third  Male   Southampton Adult  
##  6 Died     Third  Male   Queenstown  <NA>   
##  7 Died     First  Male   Southampton MAA    
##  8 Died     Third  Male   Southampton Toddler
##  9 Survived Third  Female Southampton Adult  
## 10 Survived Second Female Cherbourg   Teen   
## # … with 881 more rows
## # A tibble: 38 × 10
## # Groups:   cat, Sex, Pclass [38]
##    cat     Sex    Pclass  CSCT     S     D    SR    DR cat1               group1
##    <chr>   <chr>  <chr>  <int> <dbl> <dbl> <dbl> <dbl> <chr>              <chr> 
##  1 Teen    Female Second     8     8     0   100     0 Teen Female Second All S…
##  2 Toddler Female Second     4     4     0   100     0 Toddler Female Se… All S…
##  3 Child   Female Second     4     4     0   100     0 Child Female Seco… All S…
##  4 Baby    Male   Second     5     5     0   100     0 Baby Male Second   All S…
##  5 Teen    Female First     13    13     0   100     0 Teen Female First  All S…
##  6 Baby    Female Third      4     4     0   100     0 Baby Female Third  All S…
##  7 Toddler Male   Second     3     3     0   100     0 Toddler Male Seco… All S…
##  8 Senior  Female First      3     3     0   100     0 Senior Female Fir… All S…
##  9 Baby    Male   First      1     1     0   100     0 Baby Male First    All S…
## 10 Toddler Male   First      1     1     0   100     0 Toddler Male First All S…
## # … with 28 more rows